Visualising Genomics Data in R/Bioconductor.

In complement to our IGV genome browser course where we reviewed visualising genomics data in a browser, here we will use R/Bioconductor to produce publication quality graphics programatically.

Much of the material will require some familiarity with R and Bioconductor (you can revisit our courses on those here) and these will be used in tight conjunction with tools introduced today such as the Bioconductor package, Gviz.

Reminder of file types

In this session we will be dealing with a range of data types. For more information on file types you can revisit our material.

For more information on visualising genomics data in browsers you can visit our IGV course.

  • IGV.

    Reminder of data types in Bioconductor

We will also encounter and make use of many data structures and data types which we have seen throughout our courses on HTS data. You can revisit this material to refresh on HTS data analysis in Bioconductor and R below.

Materials. - Presentations, source code and practicals.

Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.

  • viz_course/presentations/Slides/ Presentations as an HTML slide show.
  • viz_course/presentations/exercises/ Some tasks/examples to work through.

Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.

You may navigate to the unarchived VisualisingGenomicsData folder in the Rstudio menu

Session -> Set Working Directory -> Choose Directory

or in the console.

setwd("/PathToMyDownload/VisualizingGenomicsData/viz_course/presentations/Slides") 
# e.g. setwd("~/Downloads/VisualizingGenomicsData/viz_course/presentations/Slides") 

But we covered this??

We have already discussed on using the IGV browser to review our data and get access to online data repositories.

IGV is quick, user friendly GUI to perform the essential task of review genomics data in its context.

For more information see our course on IGV here.

Visualising Genomics Data around Genomic Features in R (Gviz)

The Gviz packages offers methods to produce publication quality plots of genomics data at genomic features of interest.

To get started using Gviz in some biological examples, first we need to install the package.

## try http:// if https:// URLs are not supported 
source("https://bioconductor.org/biocLite.R") 
biocLite("Gviz") 

Getting started with Gviz – Plotting the axis

Now we have created a GenomeAxisTrack track object we can display the object using plotTracks function.

In order to display a axis track we need to set the limits of the plot (otherwise where would it start and end?).

plotTracks(genomeAxis,from=100,to=10100) 

Getting started with Gviz – Configuring the axis (part-2)

We can also configure the resolution of the axis (albeit rather bluntly) using the littleTicks parameter.

This will add additional axis tick marks between those shown by default.

plotTracks(genomeAxis,from=100,to=10100, 
           littleTicks = TRUE) 

Getting started with Gviz – Configuring the axis (part-3)

By default the plot labels for the genome axis track are alternating below and above the line.

We can further configure the axis labels using the labelPos parameter.

Here we set the labelPos to be always below the axis

plotTracks(genomeAxis,from=100,to=10100, 
           labelPos="below") 

Getting started with Gviz – Configuring the axis (part-4)

In the previous plots we have produced a genomic axis which allows us to consider the position of the features within the linear genome.

In some contexts we may be more interested in relative distances around and between the genomic features being displayed.

We can configure the axis track to give us a relative representative of distance using the scale parameter.

plotTracks(genomeAxis,from=100,to=10100, 
           scale=1,labelPos="below") 

Getting started with Gviz – Configuring the axis (part-4b)

We may want to add only a part of the scale (such as with Google Maps) to allow the reviewer to get a sense of distance.

We can specify how much of the total axis we wish to display as a scale using a value of 0 to 1 representing the proportion of scale to show.

plotTracks(genomeAxis,from=100,to=10100, 
           scale=0.3) 

Getting started with Gviz – Configuring the axis (part-4c)

We can also provide numbers greater than 1 to the scale parameter which will determine, in absolute base pairs, the size of scale to display.

plotTracks(genomeAxis,from=100,to=10100, 
           scale=2500) 

Getting started with Gviz – Axis and Regions of Interest (part-2)

We can add “regions of interest” to the axis plotted by Gviz as we have done with IGV.

To do this we will need to define some ranges to signify the positions of “regions of interest” in the linear context of our genome track.

Since the plots have no apparent context for chromosomes (yet), we will use a IRanges object to specify “regions of interest” as opposed to the genome focused GRanges.

You can see our material here on Bioconductor objects for more information on IRanges and GRanges.

Brief recap (Creating an IRanges)

To create an IRanges object we will load the IRanges library and specify vectors of start and end parameters to the IRanges constructor function.

library(IRanges) 
regionsOfInterest <- IRanges(start=c(140,5140),end=c(2540,7540)) 
names(regionsOfInterest) <- c("ROI_1","ROI_2") 
regionsOfInterest 
## IRanges object with 2 ranges and 0 metadata columns:
##             start       end     width
##         <integer> <integer> <integer>
##   ROI_1       140      2540      2401
##   ROI_2      5140      7540      2401

Getting started with Gviz – Axis and Regions of Interest (part-4)

We include the names specified in the IRanges for the regions of interest within the axis plot by specify the showID parameter to TRUE.

plotTracks(genomeAxis,from=100,to=10100, 
           range=regionsOfInterest, 
           showId=T) 

Plotting regions in Gviz - Data tracks

Lets update our IRanges object to have some score columns in the metadata columns. We can do this with the mcols function as shown in our Bioconductor material.

mcols(regionsOfInterest) <- data.frame(Sample1=c(30,20),
                                       Sample2=c(20,200)) 
regionsOfInterest <- GRanges(seqnames="chr5",
                             ranges = regionsOfInterest) 
regionsOfInterest 
## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames       ranges strand |   Sample1   Sample2
##            <Rle>    <IRanges>  <Rle> | <numeric> <numeric>
##   ROI_1     chr5 [ 140, 2540]      * |        30        20
##   ROI_2     chr5 [5140, 7540]      * |        20       200
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Plotting regions in Gviz - Data tracks

As we have seen, DataTrack objects make use of IRanges/GRanges which are the central workhorse of Bioconductors HTS tools.

This means we can take advantage of the many manipulations available in the Bioconductor tool set.

Lets make use of rtracklayer’s importing tools to retrieve coverage from a bigWig as a GRanges object

library(rtracklayer) 
allChromosomeCoverage <- import.bw("../../Data/small_Sorted_SRR568129.bw",
                                   as="GRanges") 
class(allChromosomeCoverage)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"

Plotting regions in Gviz - Data tracks (part 4)

Now we have our coverage as a GRanges object we can create our DataTrack object from this.

Notice we specify the chromsome of interest in the chromosome parameter.

accDT <- DataTrack(allChromosomeCoverage,chomosome="chr5") 
accDT 
## DataTrack 'DataTrack'
## | genome: NA
## | active chromosome: chrM
## | positions: 1
## | samples:1
## | strand: *
## There are 248 additional annotation features on 24 further chromosomes
##   chr1: 1
##   chr10: 1
##   chr11: 1
##   chr12: 1
##   chr13: 1
##   ...
##   chr7: 1
##   chr8: 1
##   chr9: 1
##   chrX: 1
##   chrY: 1
## Call seqlevels(obj) to list all available chromosomes or seqinfo(obj) for more detailed output
## Call chromosome(obj) <- 'chrId' to change the active chromosome

Plotting regions in Gviz - Data tracks (part 6)

We can adjust the type of plots we want using the type argument. Here as with standard plotting we can specify “l” to get a line plot.

plotTracks(accDT, 
           from=134887451,to=134888111, 
           chromosome="chr5",type="l") 

Plotting regions in Gviz - Data tracks (part 7)

Histograms by specifying “h”.

plotTracks(accDT, 
           from=134887451,to=134888111, 
           chromosome="chr5",type="h") 

Plotting regions in Gviz - Data tracks (part 8)

Or filled/smoothed plots using “mountain”.

plotTracks(accDT, 
           from=134887451,to=134888111, 
           chromosome="chr5",type="mountain") 

Plotting regions in Gviz - Additional Parameters.

As with all plotting functions in R, Gviz plots are highly customisable.

Simple features such as point size and colour are easily set as for standard R plots using cex and col paramters.

plotTracks(accDT, 
           from=134887451,to=134888111, 
           chromosome="chr5", 
           col="red",cex=4) 

Putting track togethers - Ordering tracks in plot

The order of tracks in the plot is simply defined by the order they are placed in the vector passed to plotTracks()

plotTracks(c(genomeAxis,accDT), 
           from=134887451,to=134888111, 
           chromosome="chr5" 
           ) 

Putting track togethers - Controling height of tracks in plot

By default, Gviz will try and provide sensible track heights for your plots to best display your data.

The track height can be controlled by providing a vector of relative heights to the sizes parameter of the plotTracks() function.

Adding annotation to plots.

Genomic annotation, such as Gene/Transcript models, play an important part of visualising genomics data in context.

Gviz provides many routes for constructing genomic annotation using the AnnotationTrack() constructor function.

toGroup <- GRanges(seqnames="chr5", 
        IRanges( 
          start=c(10,500,550,2000,2500), 
          end=c(300,800,850,2300,2800) 
        )) 
names(toGroup) <- seq(1,5) 
toGroup 
## GRanges object with 5 ranges and 0 metadata columns:
##     seqnames       ranges strand
##        <Rle>    <IRanges>  <Rle>
##   1     chr5 [  10,  300]      *
##   2     chr5 [ 500,  800]      *
##   3     chr5 [ 550,  850]      *
##   4     chr5 [2000, 2300]      *
##   5     chr5 [2500, 2800]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Adding annotation to plots.

We can see the features are displayed grouped by lines.

But if we want to see the names we must specify the group parameter by using the groupAnnotation argument.

plotTracks(annoT,groupAnnotation = "group") 

Adding annotation to plots. Strands and direction.

When we created the GRanges used here we did not specify any strand information.

strand(toGroup) 
## factor-Rle of length 5 with 1 run
##   Lengths: 5
##   Values : *
## Levels(3): + - *

When plotting annotation without strand a box is used to display features as seen in previous slides

Adding annotation to plots. Strands and direction (part-2).

Now we can specify some strand information for the GRanges and replot.

Arrows now indicate the strand which the features are on.

strand(toGroup) <- c("+","+","*","-","-") 
annoT <- AnnotationTrack(toGroup, 
                group = c("Ann1","Ann1", 
                          "Ann2", 
                          "Ann3","Ann3")) 
 
plotTracks(annoT, groupingAnnotation="group") 

Adding annotation to plots. Controlling the display density

In the IGV course we saw how you could control the display density of certain tracks.

Annotation tracks are often stored in files such as the general feature format (see our previous course).

IGV allows us to control the density of these tracks in the view options by setting to “collapsed”, “expanded” or “squished”.

Whereas “squished” and “expanded” maintains much of the information within the tracks, “collapsed” flattens overlapping features into a single displayed feature.

Adding annotation to plots. Controlling the display density (part 3)

By setting the stacking parameter to “dense”, all overlapping features have been collapsed/flattened

plotTracks(annoT, groupingAnnotation="group",stacking="dense") 

Adding annotation to plots. Setting feature types.

We can set our own feature types for the AnnotationTrack object using the same feature() function.

We can choose any feature types we wish to define.

feature(annoT) <- c(rep("Good",4),rep("Bad",2)) 
feature(annoT) 
## [1] "Good" "Good" "Good" "Good" "Bad"  "Bad"

GeneRegionTrack

We have seen how we can display complex annotation using the AnnotationTrack objects.

For gene models Gviz contains a more specialised object, the GeneRegionTrack object.

The GeneRegionTrack object contains additional parameters and display options specific for the display of gene models.

Lets start by looking at the small gene model set stored in the Gviz package.

data(geneModels) 
geneModels[1,]
##   chromosome    start      end width strand feature            gene
## 1       chr7 26591441 26591829   389      + lincRNA ENSG00000233760
##              exon      transcript     symbol
## 1 ENSE00001693369 ENST00000420912 AC004947.2

GeneRegionTrack - Setting up the gene model track.

We can define a GeneRegionTrack as we would all other tracktypes. Here we provide a genome name, chromosome of interest and a name for the track.

grtrack <- GeneRegionTrack(geneModels, genome = "hg19", 
                           chromosome = "chr7", 
                           name = "smallRegions") 
plotTracks(grtrack) 

GeneRegionTrack - Setting up the gene model track.

plotTracks(grtrack) 

We can see that features here are rendered slightly differently to those in an AnnotationTrack object.

Here direction is illustrated by arrows in introns and unstranslated regions are shown as narrower boxes.

GeneRegionTrack - Specialised labelling.

Similarly we can label transcripts by their individual transcript names.

plotTracks(grtrack,transcriptAnnotation="transcript") 

GeneRegionTrack - Specialised labelling.

Or we can label using the transcriptAnnotation object by any arbitary column where there is one level per transcript.

plotTracks(grtrack,transcriptAnnotation="symbol") 

GeneRegionTrack - Specialised labelling of exons.

As with transcripts we can label individual features using the exonAnnotation parameter by any arbitary column where there is one level per feature/exon.

plotTracks(grtrack,exonAnnotation="exon",
           from=26677490,to=26686889,cex=0.5) 

GeneRegionTrack - Specialized display density for gene models.

We saw that we can control the display density when plotting AnnotationTrack objects.

We can control the display density of GeneRegionTracks in the same manner.

plotTracks(grtrack, stacking="dense") 

GeneRegionTrack - Specialized display density for gene models.

However, since the GeneRegionTrack object is a special class of the AnnotationTrack object we have special parameter for dealing with display density of transcripts.

The collapseTranscripts parameter allows us a finer degree of control than that seen with stacking parameter. Here we set collapseTranscripts to be TRUE inorder to merge all overlapping transcripts.

plotTracks(grtrack, collapseTranscripts=T, 
           transcriptAnnotation = "symbol") 

GeneRegionTrack - Specialized display density for gene models.

Collapsing using the collapseTranscripts has summarised our transcripts into their respective gene boundaries.

We have however lost information on the strand of transcripts. To retain this information we need to specify a new shape for our plots using the shape parameter. To capture direction we use the “arrow” shape

plotTracks(grtrack, collapseTranscripts=T, 
           transcriptAnnotation = "symbol", 
           shape="arrow") 

GeneRegionTrack - Specialized display density for gene models.

The collapseTranscripts function also allows us some additional options by which to collapse our transcripts.

These methods maintain the intron information in the gene model and so get us closer to reproducing the “collapsed” feature in IGV.

Here we may collapse transcripts to the “longest”.

plotTracks(grtrack, collapseTranscripts="longest", 
           transcriptAnnotation = "symbol") 

GeneRegionTrack - Building your own gene models.

We have seen in previous material how gene models are organised in Bioconductor using the TxDB objects.

Gviz may be used in junction with TxDB objects to construct the GeneRegionTrack objects.

We saw in the Bioconductor and ChIPseq course that many genomes have pre-build gene annotation within the respective TxDB libraries. Here we will load a TxDb for hg19 from the TxDb.Hsapiens.UCSC.hg19.knownGene library.

library(TxDb.Hsapiens.UCSC.hg19.knownGene) 
 
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene 
txdb 
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1

GeneRegionTrack - Building your own gene models from a TxDb.

With our new GeneRegionTrack we can now reproduce the gene models using the Bioconductor TxDb annotation.

Here the annotation is different but transcripts overlapping uc003syc are our SKAP2 gene.

plotTracks(customFromTxDb, 
           from=26591341,to=27034958, 
           transcriptAnnotation="gene") 

GeneRegionTrack - Building your own gene models from a GFF.

Now by combining the ability to create our own TxDb objects from GFFs we can create a very custom GeneRegionTrack from a GFF file. Here i use the UCSC table browser to extract a GTF of interest.

library(GenomicFeatures) 
ensembleGTF <- "~/Downloads/hg19.gtf.gz"
txdbFromGFF <- makeTxDbFromGFF(file = ensembleGTF) 
customFromTxDb <- GeneRegionTrack(txdbFromGFF,chromosome="chr7") 
plotTracks(customFromTxDb, 
           from=26591341,to=27034958, 
           transcriptAnnotation="gene") 

Exercises

Time for exercises! Link here


Solutions

Time for solutions! Link here